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Abstract 

Interactive partially observable Markov decision processes 
(l-POMDP) provide a formal framework for planning for a 
self-interested agent in multiagent settings. An agent operat¬ 
ing in a multiagent environment must deliberate about the ac¬ 
tions that other agents may take and the effect these actions 
have on the environment and the rewards it receives. Tradi¬ 
tional l-POMDPs model this dependence on the actions of 
other agents using joint action and model spaces. Therefore, 
the solution complexity grows exponentially with the num¬ 
ber of agents thereby complicating scalability. In this paper, 
we model and extend anonymity and context-specific indepen¬ 
dence - problem structures often present in agent populations 
- for computational gain. We empirically demonstrate the ef¬ 
ficiency from exploiting these problem structures by solving 
a new multiagent problem involving more than 1,000 agents. 


Introduction 

We focus on the decision-making problem of an in¬ 
dividual agent operating in the presence of other 
self-interested agents whose actions may affect the 
state of the environment and the subject agent’s re¬ 
wards. In stochastic and partially observable environ¬ 
ments, this problem is formalized by the interactive 
POM DP (l-POMDP) ([Gmytrasiewicz and Doshi 2005 1 . 


l-POMDPs cover an important portion of the multiagent 
planning problem space (ISeuken and Zilberstein 20081 
Dosh i 2012l l. and applications in diverse areas such as 
security ( Ng et al. 2010[ Seymour and Peterson 2009 1 , 


robotics (Wang 2013 

ad hoc teams dChandrasekaran et al. 20141 ) 
human behavior 


IWoodward and Wood 20121 1. 

and 


modeling (iDoshi et al. 20101 
IWunder et al. 20111 ) testify to its wide appeal while 
critically motivating better scalability. 

Previous l-POMDP solution approximations such as in¬ 
teractive particle filtering (jDoshi and Gmytrasiewicz 2009), 
point-based value iteration dDoshi and Perez 20081 
and interactive bounded policy iteration (I- 
BPI) (iSonu and Doshi 20141 scale l-POMDP solutions 
to larger physical state, observation and model spaces. 
Hoang and Low (120131 1 introduced the specialized I- 
POMDP Lite framework that promotes efficiency by 
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modeling other agents as nested MDPs. However, to 
the best of our knowledge no effort specifically scales 
l-POMDPs to many interacting agents - say, a population 
of more than a thousand - sharing the environment. 

For illustration, consider the decision-making problem of 
the police when faced with a large protest. The degree of the 
police response is often decided by how many protestors of 
which type (disruptive or not) are participating. The individ¬ 
ual identity of the protestor within each type seldom matters. 
This key observation of frame-action anonymity motivates 
us in how we model the agent population in the planning 
process. Furthermore, the planned degree of response at a 
protest site is influenced, in part, by how many disruptive 
protestors are predicted to converge at the site and much less 
by some other actions of protestors such as movement be¬ 
tween other distant sites. Therefore, police actions depend 
on just a few actions of note for each type of agent. 

The example above illustrates two known and power¬ 
ful types of problem structure in domains involving many 
agents: action anonymity (Roughgarden and Tardos 2002) 
and context-specific independence dBoutilier et al. 19961 . 
Action anonymity allows the exponentially large joint ac¬ 
tion space to be substituted with a much more compact 
space of action configurations where a configuration is a tu¬ 
ple representing the number of agents performing each ac¬ 
tion. Context-specific independence (wherein given a con¬ 
text such as the state and agent’s own action, not all actions 
performed by other agents are relevant) permits the space 
of configurations to be compressed by projecting counts 
over a limited set of others’ actions. We extend both action 
anonymity and context-specific independence to allow con¬ 
siderations of an agent’s frame as well.Q We list the specific 
contributions of this paper below: 

1. l-POMDPs are severely challenged by large numbers 
of agents sharing the environment, which cause an ex¬ 
ponential growth in the space of joint models and ac¬ 
tions. Exploiting problem structure in the form of frame- 
action anonymity and context-specific independence, we 
present a new method for considerably scaling the so¬ 
lution of l-POMDPs to an unprecedented number of 


1 l-POMDPs distinguish between an agent’s frame and type 
with the latter including beliefs as well. Frames are similar in se¬ 
mantics to the colloquial use of types. 






































agents. 

2. We present a systematic way of modeling the prob¬ 
lem structure in transition, observation and reward func¬ 
tions, and integrating it in a simple method for solving I- 
POMDPs that models other agents using finite-state ma¬ 
chines and builds reachability trees given an initial belief. 

3. We prove that the Bellman equation modified to include 
action configurations and frame-action independences 
continues to remain optimal given the l-POMDP with 
explicated problem structure. 

4. Finally, we theoretically verify the improved savings in 
computational time and memory, and empirically demon¬ 
strate it on a new problem of policing protest with over a 
thousand protestors. 

The above problem structure allows us to emphatically 
mitigate the curse of dimensionality whose acute impact on 
l-POMDPs is well known. However, it does not lessen the 
impact of the curse of history. In this context, an additional 
step of sparse sampling of observations while generating the 
reachability tree allows sophisticated planning with a popu¬ 
lation of 1,000+ agents using about six hours. 


Related Work 


Building on graphical games (Kearns, Littman, and Singh 20 01), 
action graph games (AGG) (jJiang, Leyton-Brown, and Bhat 2011) 
utilize problem structures such as action anonymity and 
context-specific independence to concisely represent single 
shot complete-information games involving multiple agents 
and to scalably solve for Nash equilibrium. The indepen¬ 
dence is modeled using a directed action graph whose nodes 
are actions and an edge between two nodes indicates that 
the reward of an agent performing an action indicated by 
one node is affected by other agents performing action 
of the other node. Lack of edges between nodes encodes 
the context-specific independence where the context is the 
action. Action anonymity is useful when the action sets of 
agents overlap substantially. Subsequently, the vector of 
counts over the set of distinct actions, called a configuration , 
is much smaller than the space of action profiles. 

We substantially build on AGGs in this paper by ex¬ 
tending anonymity and context-specific independence to in¬ 
clude agent frames, and generalizing their use to a par¬ 
tially observable stochastic game solved using decision- 
theoretic planning as formalized by l-POMDPs. Indeed, 
Bayesian AGGs (Jiang and Leyton-Brown 2010) extend the 
original formulation to include agent types. These result 
in type-specific action sets with the benefit that the ac¬ 
tion graph structure does not change although the number 
of nodes grows with types: |0||A| nodes for agents with 
|0| types each having same |A| actions. If two actions 
from different type-action sets share a node, then these ac¬ 
tions are interchangeable. A key difference in our repre¬ 
sentation is that we explicitly model frames in the graphs 
due to which context-specific independence is modeled us¬ 
ing frame-action hypergraphs. Benefits are that we natu¬ 
rally maintain the distinction between two similar actions 
but performed by agents of different frames, and we add 
less additional nodes: |0| + ,4|. However, a hypergraph 


is a more complex data structure for operation. Tempo¬ 
ral AGGs ( jJiang, Leyton-Brown, and Pfeffer 2009| extend 
AGGs to a repeated game setting and allow decisions to con¬ 
dition on chance nodes. These nodes may represent the ac¬ 
tion counts from previous step (similar to observing the ac¬ 
tions in the previous game). Temporal AGGs come closest to 
multiagent influence diagrams ( IKoller and Milch 20011 ) al¬ 
though they can additionally model the anonymity and inde¬ 
pendence structure. Overall, l-POMDPs with frame-action 
anonymity and context-specific independence significantly 
augment the combination of Bayesian and temporal AGGs 
by utilizing the structures in a partially observable stochastic 
game setting with agent types. 

Varakantham et al. (12014b building on previous 
work (IVarakantham et al. 20121 ) recently introduced a de¬ 
centralized MDP that models a simple form of anonymous 
interactions: rewards and transition probabilities specific to 
a state-action pair are affected by the number of other agents 
regardless of their identities. The interaction influence is 
not further detailed into which actions of other agents are 
relevant (as in action anonymity) and thus configurations 
and hypergraphs are not used. Furthermore, agent types 
are not considered. Finally, the interaction hypergraphs in 
networked-distributed POMDPs ( Nair et al. 20051 ) model 
complete reward independence between agents - analogous 
to graphical games - which differs from the hypergraphs in 
this paper (and action graphs) that model independence in 
reward (and transition, observation probabilities) along a 
different dimension: actions. 


Background 

Interactive POMDPs allow a self-interested agent to plan 
individually in a partially observable stochastic environ¬ 
ment in the presence of other agents of uncertain types. We 
briefly review the l-POMDP framework and refer the reader 
to (Gmytrasiewicz and Doshi 2005) for further details. 

A finitely-nested interactive l-POMDP for an agent (say 
agent 0) of strategy level l operating in a setting inhabited 
by one of more other interacting agents is defined as the fol¬ 
lowing tuple: 

I-POMDPq,/ = (ISo : i,A,To,n 0 ,Oo,Ro,OCo) 


• ISq,i denotes the set of interactive states defined as, 

IS 0 ,i = S x where = {Qj,i-i U 

SMj}, for l > 1, and 1 S t s) = S, where S is the set 
of physical states. Qj.i-i is the set of computable, inten¬ 
tional models ascribed to agent j: 9j,i-i = ( bjj-i,9j ), 
where bjj -1 is agent j’s level l — 1 belief, bjj-i £ 

A(/%z_i), and 0j = {A, Tj , Clj , Oj , Rj , OCj), is j’s 
frame. Here, j is assumed to be Bayes-rational. At level 0, 
bj'O £ A(S) and a level-0 intentional model reduces to a 
POMDP. SMj is the set of subintentional models of j , an 
example is a finite state automaton; 

• A = Aq x A\ x ... x An is the set of joint actions of all 
agents; 

• To : S x Aq x IIjLi Aj x S — > [0,1] is the transition 
function; 


















• U 0 is the set of agent 0’s observations; 

• Oo : 'S' x Aq x nfLi Aj x fig —» [0,1] is the observation 
function; 

• i?o : S x A 0 x nJLi A? —> K is the reward function; and 

• OCi) is the optimality criterion, which is identical to that 
for POMDPs. In this paper, we consider a finite-horizon 
optimality criteria. 

Besides the physical state space, the l-POMDP’s interac¬ 
tive state space contains all possible models of other agents. 
In its belief update, an agent has to update its belief about 
the other agents’ models based on an estimation about the 
other agents’ observations and how they update their models. 
As the number of agents sharing the environment grows, the 
size of the joint action and joint model spaces increases expo¬ 
nentially. Therefore, the memory requirement for represent¬ 
ing the transition, observation and reward functions grows 
exponentially as well as the complexity of performing belief 
update over the interactive states. In the context of N agents, 
interactive bounded policy iteration dSonu and Doshi 2014t > 
generates good quality solutions for an agent interacting 
with 4 other agents (total of 5 agents) absent any problem 
structure. To the best of our knowledge, this result illustrates 
the best scalability so far to N > 2 agents. 

Many-Agent I-POMDP 

To facilitate understanding and experimentation, we intro¬ 
duce a pragmatic running example that also forms our eval¬ 
uation domain. 


Site 1 



Figure 1: Protestors of different frames (colors) and police troops 
at two of three sites in the policing protest domain. The state space 
of police decision making is factored into the protest intensity lev¬ 
els at the sites. 


the protest intensity escalates. On the other hand, two police 
troops at a location are adequate for de-escalating protests. 


Factored Beliefs and Update 

As we mentioned previously, the subject agent in an I- 
POMDP maintains a belief over the physical state and 
joint models of other agents, 6o,z G A(S x ^j,l- 1). 

where A(-) is the space of probability distributions. For 
settings such as Example Q] where N is large, the size of 
the interactive state space is exponentially larger, |/5o,z| = 
\S\\Mj t i_x\ N , and the belief representation unwieldy. How¬ 
ever, the representation becomes manageable for large N if 
the belief is factored: 


= Pr(s) Pr(mij-i\s) 
x Pr(m 2 ,i-i\s) x ... x Pr(m Ni i-i\s) (1) 


This factorization assumes conditional independence of 
models of different agents given the physical state. Conse¬ 
quently, beliefs that correlate agents may not be directly rep¬ 
resented although correlation could be alternately supported 
by introducing models with a correlating device. 

The memory consumed in storing a factored belief is 
(9(|Sj + -A|5j|M*|), where \M*\ is the size of the largest 
model space among all other agents. This is linear in the 
number of agents, which is much less than the exponentially 
growing memory required to represent the belief as a joint 
distribution over the interactive state space, 0(\S\\M*\ N ). 

Given agent 0’s belief at time t , &q z , its action a l 0 and the 
subsequent observation it makes, cuq + 1 , the updated belief at 
time step t + 1, bg ^ , may be obtained as: 


Pr(s + ,m 1 + l _ 1 ,...,rn r f tl _ 1 \b 0>l ,a t 0 ,w 0 + ) = Pr{s 
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l,Z_ll* ’ m 2,Z-l» * • * ’ "°N,l-V 


1 


Vz> a 0> w 0 ) X ■■■ xPr ( m N,l-l\s A,i> a 0> w 0 ) 


( 2 ) 


Each factor in the product of Eq. [2] may be obtained as 
follows. The update over the physical state is: 


Example 1 (Policing Protest) Consider a policing sce¬ 
nario where police (agent 0) must maintain order in 3 geo¬ 
graphically distributed and designated protest sites (labeled 
0, 1, and 2) as shown in Fig. \J\ A population of N agents 
are protesting at these sites. Police may dispatch one or two 
riot-control troops to either the same or different locations. 
Protests with differing intensities, low, medium and high (dis¬ 
ruptive), occur at each of the three sites. The goal of the 
police is to deescalate protests to the low intensity at each 
site. Protest intensity at any site is influenced by the number 
of protestors and the number of police troops at that loca¬ 
tion. In the absence of adequate policing, we presume that 


Pr(s t+1 \b t 0 i,a t 0 ,uj t 0 +1 ) oc Pr(s t+1 , Wg +1 |6o,i, Oo) 

= XXz( s ‘) b oA m i,i-i\ st ) x ... x b^fm^i-M) 

s* m *_ 0 

x ... x Pr(a t N \m t N l _ 1 ) 

a -o 

x Og +1 (V +1 , (ag, a^_ 0 ), Wg +1 ) T 0 ( S t ,<a‘,a-o)A +1 ) 

(3) 

and the update over the model of each other agent, j = 






1 ... TV, conditioned on the state at t + 1 is: 

Pr(m t + 1 _ 1 \s t +\m t +\ tl _ 1 ,b^, a‘, w‘ +1 ) = 

J2 b o( st ) E b bA m U-l\ st ) X ■■■ xb kl( m N,l-l\ st ) 

8* m* . , , 

— 3,1 — 1 

J2 Pr ( a i\ m i,l_i) X ... x Pr(4|m f A r ji _ 1 ) 

Oj(s t+1 7 {aj, a^),u;‘ +1 ) Pr(m‘ + 1 |m*,a‘,a;f') 


-3 

E- 

t+i 


a; 


(4) 


Derivations of Eqs.[3]and[4]are straightforward and not given 
here due to lack of space. In particular, note that models of 
agents other than j at t + 1 do not impact j’s model update in 
the absence of correlated behavior. Thus, under the assump¬ 
tion of a factored prior as in Eq. Q]and absence of agent cor¬ 
relations, the l-POMDP belief update may be decomposed 
into an update of the physical state and update of the models 
of TV agents conditioned on the state. 


Frame-Action Anonymity 

As noted by Jiang et al. (1201 lb , many noncooperative and co¬ 
operative problems exhibit the structure that rewards depend 
on the number of agents acting in particular ways rather than 
which agent is performing the act. This is particularly evi¬ 
dent in Example [I] where the outcome of policing largely de¬ 
pends on the number of protestors that are peaceful and the 
number that are disruptive. Building on this, we additionally 
observe that the transient state of the protests and observa¬ 
tions of the police at a site are also largely influenced by the 
number of peaceful and disruptive protestors moving from 
one location to another. This is noted in the example below: 

Example 2 (Frame-action anonymity of protestors) The 

transient state of protests reflecting the intensity of protests 
at each site depends on the previous intensity at a site and 
the number of peaceful and disruptive protestors entering 
the site. Police (noisily) observes the intensity of protest at 
each site which is again largely determined by the number 
of peaceful and disruptive protestors at a site. Finally, the 
outcome of policing at a site is contingent on whether the 
protest is largely peaceful or disruptive. Consequently, the 
identity of the individual protestors beyond their frame and 
action is disregarded. 

Here, peaceful and disruptive are different frames of others 
in agent 0’s l-POMDP, and the above definition may be ex¬ 
tended to any number of frames. Frame-action anonymity is 
an important attribute of the above domain. We formally de¬ 
fine it in the context of agent 0’s transition, observation and 
reward functions next: 

Definition 1 (Frame-action anonymity) Let a^ 0 be a joint 
action of all peaceful protestors and a! 0 be a joint action of 
all disruptive ones. Let a^ 0 and a! 0 be permutations of the 
two joint action profiles, respectively. An l-POMDP models 
frame-action anonymity iff for any ao, s, s', a^g and af 0 : 
T 0 (s, a 0 , a^ 0 ,al 0 ,s') = T 0 (s, a 0 , a p 0 , al 0 , s'). 


Oo(s',a 0 ,a p 0 ,af 0 ,ujo) = O 0 (s', a 0 , a^ 0 , al 0 , w 0 ), and 
R 0 {s,a 0 ,a p _ 0 ,a d _ 0 ) = R 0 (s, a 0 , a^ 0l af 0 ) V a p 0 , al 0 . 

Recall the definition of an action configuration, C, as the 
vector of action counts of an agent population. A permu¬ 
tation of joint actions of others, say a^_ 0 , assigns different 
actions to individual agents. Despite this, the fact that the 
transition and observation probabilities, and the reward re¬ 
mains unchanged indicates that the identity of the agent per¬ 
forming the action is irrelevant. Importantly, the configura¬ 
tion of the joint action and its permutation stays the same: 
C(a p _ 0 ) = C(a^ 0 ). This combined with Def. Q] allows re¬ 
defining the transition, observation and reward functions 
to be over configurations as: To(s, ao, C(a I f 0 ), C(af 0 ), s'), 
Oq(s', ao,C(a I f 0 ),C(a d _ 0 ), o) and R 0 (s, a 0 , C(a%), C(a d _ 0 
))■ 

Let A p ,..., Af be overlapping sets of actions of n peace¬ 
ful protestors, and A p _ 0 is the Cartesian product of these 
sets. Let C(A P _ 0 ) be the set of all action configurations for 
A p _ g. Observe that multiple joint actions from A^q may re¬ 
sult in a single configuration; these joint actions are config¬ 
uration equivalent. Consequently, the equivalence partitions 
the joint action set A^ 0 into |C(A^ 0 )| classes. Furthermore, 
when other agents of same frame have overlapping sets of ac¬ 
tions, the number of configurations could be much smaller 
than the number of joint actions. Therefore, definitions of 
the transition, observation and reward functions involving 
configurations could be more compact. 

Frame-Action Hypergraphs 

In addition to frame-action anonymity, domains involving 
agent populations often exhibit context-specific indepen¬ 
dences. This is a broad category and includes the context- 
specific independence found in conditional probability ta¬ 
bles of Bayesian networks dBoutilier et al. 1996l > and in 
action-graph games. It offers significant additional structure 
for computational tractability. We begin by illustrating this 
in the context of Example Q] 

Example 3 (Context-specific independence in policing) 

At a low intensity protest site, reward for the police on 
passive policing is independent of the movement of the 
protestors to other sites. The transient intensity of the 
protest at a site given the level of policing at the site 
(context) is independent of the movement of protestors 
between other sites. 

The context-specific independence above builds on the 
similar independence in action graphs in two ways: (i) We 
model such partial independence in the transitions of fac¬ 
tored states and in the observation function as well, in ad¬ 
dition to the reward function, (ii) We allow the context- 
specific independence to be mediated by the frames of other 
agents in addition to their actions. For example, the rewards 
received from policing a site is independent of the number of 
protestors at another site, instead the rewards are influenced 
by the number of peaceful and disruptive protestors present 
at that site. 

The latter difference generalizes the action graphs into 
frame-action hypergraphs, and specifically 3-uniform hyper- 
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Figure 2: Levi (incidence) graph representation of a generic frame- 
action hypergraph for (a) the transition function, and ( b) the re¬ 
ward function. The shaded nodes represent edges in the hypergraph. 
Each edge has the context, ip, denoted in bold, agent's action, a, 
and its frame, 9, incident on it. For example, the reward for a state 
and agent 0’s action, ( s , a o)i is affected by others’ actions aj and 
aj performed by any other agent of frame Oj only. 


graphs where each edge is a set of 3 nodes. We formally 
define it below: 

Definition 2 (Frame-action hypergraph) A frame-action 
hypergraph for agent 0 is a 3-uniform hypergraph Q = 
(’T, A- o, @_o, E), where \t is a set of nodes that represent 
the context. A -o is a set of action nodes with each node rep¬ 
resenting an action that any other agent may take; @_o is a 
set of frame nodes, each node representing a frame ascribed 
to an agent, and E is a 3-uniform hyperedge containing one 
node from each set A-q, and @_o, respectively. 

Both context and action nodes differ based on whether the 
hypergraph applies to the transition, observation or reward 
functions: 

• For the transition function, the context is the set of all 

pairs of states between which a transition may occur and 
each action of agent 0, = S x A$ x S, and the ac¬ 

tion nodes includes actions of all other agents, A = 
U )=1 Aj. Neighbors of a context node ip = (s, do, s') are 
all the frame-action pairs that affect the probability of the 
transition. An edge (( s, a o, s'), a_o, 9) indicates that the 
probability of transitioning from s to s' on performing ao 
is affected (in part) by the other agents of frame 6 perform¬ 
ing the particular action in A-q. 

• The context for agent 0’s observation function is the state- 
action-observation triplet, $ = S x To x Do, and the 
action nodes are identical to those in the transition func¬ 
tion. Neighbors of a context node, (s, ao, u>o), are all those 
frame-action pairs that affect the observation probability. 
Specifically, an edge ((s, ao, too), a-o, 9) indicates that 
the probability of observing wo from state s on perform¬ 
ing ao is affected (in part) by the other agents performing 
action, a_o, who possess frame 9. 

• For agent 0’s reward function, the context is the set of 
pairs of state and action of agent 0, ’F = S x Ao, and the 
action nodes the same as those in transition and observa¬ 
tion functions. An edge (( s, a o), a_o, 6Lq) in this hyper¬ 


graph indicates that the reward for agent 0 on performing 
action ao at state s is affected (in part) by the agents of 
frame 9 -o who perform action in A-q. 

We illustrate a general frame-action hypergraph for context- 
specific independence in a transition function and a reward 
function as bipartite Levi graphs in Figs. |2ja) and (P), re¬ 
spectively. We point out that the hypergraph for the reward 
function comes closest in semantics to the graph in ac¬ 
tion graph games (Jiang, Leyton-Brown, and Bhat 2011) al¬ 
though the former adds the state to the context and frames. 
Hypergraphs for the transition and observation functions dif¬ 
fer substantially in semantics and form from action graphs. 

To use these hypergraphs in our algorithms, we first define 
the general frame-action neighborhood of a context node. 


Definition 3 (Frame-action neighborhood) The frame- 
action neighborhood of a context node ip £ ’T, v(ip), 
given a frame-action hypergraph Q is defined as a subset 
of A x 0 such that a(ip) = {(a_o,0)|a-o € A-q,9 £ 
0, (ip, a-o, 9) £ E}. 

As an example, the frame-action neighborhood of a state- 
action pair, (s,a q) in a hypergraph for the reward function 
is the set of all action and frame nodes incident on each hy¬ 
peredge anchored by the node ( s, ao). 

We move toward integrating frame-action anonymity in¬ 
troduced in the previous subsection with the context-specific 
independence as modeled above by introducing frame- 
action configurations. 


Definition 4 (Frame-action configuration) A configura¬ 
tion over the frame-action neighborhood of a context node, 
ip, given a frame-action hypergraph is a vector, 


±( C (A%),C(A%),...,C(A a ft'),C(cP)) 


@\&i\ 


where each a included in A s _ 0 is an action in is(ip) with 
frame 9, and C(A e - 0 ) is a configuration over actions by 
agents other than 0 whose frame is 9. All agents with frames 
other than those in the frame-action neighborhood are as¬ 
sumed to perform a dummy action, <p. 

Definition 0] allows further inroads into compacting 
the transition, observation and rewards functions of the 
l-POMDP using context-specific independence. Specifi¬ 
cally, we may redefine these functions one more time 
to limit the configurations only over the frame-action 
neighborhood of the context as, T 0 (s, ao,C'( s ’ ao ’ s '' > , s'), 
Oo(s>o,C^ s '’ a °’“ 0 \u;o) and f? 0 (s,ao,C^ s ’ ao )).B 


Revised Framework 

To benefit from structures of anonymity and context-specific 
independence, we redefine l-POMDP for agent 0 as: 

l-POMDPo,, = (ISo,i,A,no,To,0 0 ,1lo,OCo) 

where: 

2 Context in our transition function is (s, ao, s') compared with 

the context of just (s, ao) in Varakantham et al’s 1120141 transitions. 














• /So,;, A, Hi, and OCq remain the same as before. The 
physical states are factored as, S = nf=i x k . 

• 7o is the transition function, To{x,ao 1 C''( x,a ° ,x \x') 
where C v,yX ' a °’ x 1 is the configuration over the frame- 
action neighborhood of context (, x, a,,, x') obtained from 
a hypergraph that holds for the transition function. This 
transition function is significantly more compact than the 
original that occupies space 0(|.X’| 2 |ylo|| j 4-o| 7V ) com¬ 
pared to the 0 (|X| 2 |A o |(-p r r)l I/ *l) of 7o, where the frac¬ 
tion is the complexity of (EEi^ 1 )’ \v*\ is the maxi¬ 
mum cardinality of the neighborhood of any context, and 

<C |v4_o| w . The value ) is obtained from 

combinatorial compositions and represents the number of 
ways \v*\ + 1 non-negative values can be weakly com¬ 
posed such that their sum is N. 

• The redefined observation function is 

Oq(x', ao, w 0 ) where C i ' (x '’°°’ a ’° ) is the 

configuration over the frame-action neighborhood of 
context ( x',ao,u!o } obtained from a hypergraph that 
holds for the observation function. Analogously to the 
transition function, the original observation function 
consumes space 0 (|A'||fl||Ao||A_o| JV ), which is much 
larger than space 0 (|A'||fl||Ao|(j^rj)l 1 / *l) occupied by 
this redefinition. 

• TZq is the reward function defined as TZq(x, ao,C IJ ( x ’ a °' > ) 
where C v ^ x ' a °' > is defined analogously to the configu¬ 
rations in the previous parameters. The reward for a 
state and actions may simply be the sum of rewards 
for the state factors and actions (or a more general 
function if needed). As with the transition and observa¬ 
tion functions, this reward function is compact occupy¬ 
ing space 0 (|X||Ao|(j^r)l I '*l) that is much less than 
0(|.X'||j4o||.A_o| iV ) of the original. 

Belief Update For this extended I-POMDP, we compute 
the updated belief over a physical state as a product of its 
factors using Eq.[5]and belief update over the models of each 
other agent using Eq[ 6 ]as shown below: 


Pr(s t+1 |^,a‘,4 +1 ) oc £&$>*) II E 


K 


k= 1 a* w 4 "*" 1- ) 

C K k ,a 0’ w 0 > 


P r (C"K + 1 ’“o,^ + 1 )| 6 ‘ ii (M 1 , / _ 1 |s*) ! ..., |s‘)) 

a( 4 +1 ,«o,c^ +1 ’ a o^ +1 ), w ‘ +1 )) x {£>0,1(8*) ft 

' ^ b— 1 




a t ^t+lv 

C O' k > 


6 ^(M JV , ; _ 1 |s t ))ro( 4 ! C^’“o^ + 1 \ 4 + 1 ) 


k= 1 


(5) 


Here, the term, Pr(C t/ ( x fc + 1 ’ 0 o ,w o + 1 )| 6 * ..., 

fog ((Mjv^-ils 4 )), is the probability of a frame-action con¬ 
figuration (see Def. Q} that is context specific to the triplet. 


(x t+1 ,ao,uj t+1 ). It is computed from the factored beliefs 
over the models of all others. We discuss this computation 
in the next section. The second configuration term has an 
analogous meaning and is computed similarly. 

The factored belief update over the models of each other 
agent, j = 1... TV, conditioned on the state at t +1 becomes: 


Pr{m t +\\s t+ \rn%- 1 X,u 4) = E 4^) E b o( m 


( )E p 4 


4 K) 


E Pr ( c 




bo,,(M^-! Is*), ■ • •, b^M^- i|s*)) E °J (*’ 


.*+1 


t } C v(x*+',a* w ), w *+i) Pr (m* +1 \m \, a*, tu* +1 ) 


( 6 ) 


Proofs for obtaining Eqs.[5]and[6]are omitted due to space 
restrictions. Notice that the distributions over configurations 
are computed using distributions over other agents’ models. 
Therefore, we must maintain and update conditional beliefs 
over other agents’ models. Hence, the problem cannot be re¬ 
duced to a POMDP by including configurations with physi¬ 
cal states. 


Value Function The finite-horizon value function of the 
many-agent l-POMDP continues to be the sum of agent 0’s 
immediate reward and the discounted expected reward over 
the future: 

PP 0 (4,t> a o)+ 

o*eA 0 

7 E Pr U +1 Ki> a o) vh -\<y) (7) 

^ +1 

where Pf? 0 (fog lt ag) is the expected immediate reward of 
agent 0 and 7 is the discount factor. In the context of the re¬ 
defined reward function of the many-agent l-POMDP frame¬ 
work in this section, the expected immediate reward is ob¬ 
tained as: 


ERo(b 1 


’o,h a o) ~ E 4,z( st ) 
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E E 

k~ 1 c v(x k' a o 1 


Pr(C 


i '(4> a o)i 




4,z( A 7'Ar,z-i|s t ))f?o(a:fc) 






( 8 ) 

where the 

outermost sum 

is over all the 

state 

factors, s 4 

= {A.,--- 

■ jX 1 ^), and the 

term, 

Pr{C v ( x *’ a °) 

'Ifo^Mr^ls 4 ),.. 

.X/M^ils 4 )) 

de- 


notes the probability of a frame-action configuration that 
is context-specific to the factor, x* k . Importantly, Propo¬ 
sition m establishes that the Bellman equation above is 
exact. The proof is given in the extended version of this 
paper (|Sonu, Chen, and Doshi 2015]). 


Proposition 1 (Optimality) The dynamic programming in 
Eq.\7\ provides an exact computation of the value function 
for the many-agent l-POMDP. 




Algorithms 

We present an algorithm that computes the distribution over 
frame-action configurations and outline our simple method 
for solving the many-agent l-POMDP defined previously. 

Distribution Over Frame-Action Configurations 

Algorithm 0 generalizes an algorithm by Jiang and 
Lleyton-Brown (1201 It for computing configurations 
over actions given mixed strategies of other agents to 
include frames and conditional beliefs over models of 
other agents. It computes the probability distribution 
of configurations over the frame-action neighborhood 
of an action given the belief over the agents’ models: 
p r(C Kx,ao,«-o)| 6 t jj ( Ml)J _ 1 | s t) ) |s*)) and 

Pr(C‘'(*>° 0 '*')|6S | ’ I (M llI _ 1 | S t ),..., bl/Mivj-! |s 4 )) in 
Eq.0 Pr(C'^\bl l (M hl . 1 \s t ),. ..,i|s*)) 
in Eq. 0 and Pr{C u( - x ’ a °'>\b t 0 /M 1 j- 1 \s t ),..., 
6£ M (M J v,;-i| st )) in Eq.0 


Algorithm 1 Computing Pr{C v ^\boj{M\ t i-i\s), 

■ ■ •, &o,z(-Mjv,z-i| s )) 

Input: (&o.z(Mi.z-i|s),... , b 0 ,i{M N}l -i\s)) 

Output: A trie V n representing distribution over the frame- 
action configurations over v{-) 

1: Initialize cq <— (0,..., 0), one value for each frame- 
action pair in z/(-) and for (j>. Insert into empty trie Vo 
2: Initialize Vo[co\ •<— 1 
3: for j <— 1 to A' do 
4: Initialize Vj to be an empty trie 

5: for all Cj_i from Vj -1 do 

6: for all nij j-i G Mji_i do 

7: for all aj G Aj such that Pr(aj\nrijj-\) > 0 

do 

8 : Cj 3 — Cj—\ 

9: if ( a,j,6j) G v(-) then 

10 : < cjlci,] ■ 1 

11: else 

12: Cj[<f)\ <—Cj[4>\ + 1 

13: if Vj [cj] does not exist then 

14: Initialize Vj[cj] 3— 0 

15: Pj[ c j\ Pj[ c j] + ^j-l[ c j-i] x 

Pi~( a j\rrijj-i) x & 0 ,zz_i|s) 

16: return V n 


Algorithm0adds the actions of each agent one at a time. 
A Trie data structure enables efficient insertion and access 
of the configurations. We begin by initializing the configu¬ 
ration space for 0 agents {Vo) to contain one tuple of inte¬ 
gers (co) with \v\ + 1 Os and assign its probability to be 1 
(lines 1-2). Using the configurations of the previous step, we 
construct the configurations over the actions performed by j 
agents by adding 1 to a relevant element depending on j’s 
action and frame (lines 3-15). If an action aj performed by 
j with frame rhj is in the frame-action neighborhood v{-), 
then we increment its corresponding count by 1. Otherwise, 
it is considered as a dummy action and the count of cj) is in¬ 


cremented (lines 9-12). Similarly, we update the probability 
of a configuration using the probability of aj and that of the 
base configuration Cj-i (line 15). This algorithm is invoked 
multiple times for different values of z/(-) as needed in the 
belief update and value function computation. 

We utilize a simple method for solving the many-agent I- 
POMDP given an initial belief: each other agent is modeled 
using a finite-state controller as part of the interactive state 
space. A reachability tree of beliefs as nodes is projected for 
as many steps as the horizon (using Eqs.0and0 and value 
iteration (Eq.0 is performed on the tree. In order to mitigate 
the curse of history due to the branching factor that equals 
the number of agent 0’s actions and observations, we utilize 
the well-known technique of sampling observations from the 
propagated belief and obtain a sampled tree on which value 
iteration is run to get a policy. Action for any observation 
that does not appear in the sample is that which maximizes 
the immediate expected reward. 

Computational Savings 

The complexity of accessing an element in a ternary search 
trie is O(z'). The maximum number of configurations 
encountered at any iteration is upper bounded by total 
number of configurations for N agents, i.e. 0(( t^t)^ )• 
The complexity of Algorithm Q] is polynomial in N, 
0(7V|M*||A*||z/*|( 1 5 T )l !/ *l) where M* and A* are largest 
sets of models and actions for any agent. 

For the traditional l-POMDP belief update, the complex¬ 
ity of computing Eq. 0 is 0(\S\\M*\ N \A*\ N ) and that 
for computing Eq. 0 is C9(|S'||M*| JV |A*| JV |f2*|) where * 
denotes the maximum cardinality of a set for any agent. 
For a factored representation, belief update operator in¬ 
vokes Eq. 0 for each value of all state factors and it in¬ 
vokes Eq. 0 for each model of each agent j and for all 
values of updated states. Hence the total complexity of be¬ 
lief update is 0(N\M*\\S\ 2 \M*\ N \A*\ N \n*\). The com¬ 
plexity of computing updated belief over state factor x t+1 
using Eq. 0 is 0(\S\NK\M*\\A*\\v*\^)\ v *\) (recall 
the complexity of Algorithm 0. Similarly, the complex¬ 
ity of computing updated model probability using Eq. 0 is 
0{{\S\N\M*\\A*\\is*\ + |D*|)( t ^) K I). These complex¬ 
ity terms are polynomial in N for small values of v* as 
opposed to exponential in N as in Eqs.0and0 The overall 
complexity of belief update is also polynomial in N. 

Complexity of computing the immediate ex¬ 
pected reward in the absence of problem structure is 
0{\S\K\M*\ n \A*\ n ). On the other hand, the com¬ 
plexity of computing expected reward using Eq. 0 is 
0{\S\KN\M*\\A*\y\{^)^\), which is again polyno¬ 
mial in N for low values of \v*\. These complexities are 
discussed in greater detail in ( jSonu, Chen, and Doshi 2015) . 

Experiments 

We implemented a simple and systematic l-POMDP solv¬ 
ing technique that computes reachable beliefs over the fi¬ 
nite horizon and then calculates the optimal value at the 












root node using the Bellman equation for the Many-Agents 
l-POMDP framework. We evaluate its performance in the 
aforementioned non-cooperative policing protest scenario 

(|S| = 27,14)1 = 9,141 = 4,10,1 = 8, \0i\ = 8). We 

model the other agents as POM DPs and solve them using 
bounded policy iteration (JPoupart and Boutilier 2003), rep¬ 
resenting the models as finite state controllers. This repre¬ 
sentation enables us to have a compact model space. We set 
the maximum planning horizon to 4 throughout the experi¬ 
ments. The frame-action hypergraphs are encoded into the 
transition, observation and reward functions of the Many- 
Agent l-POMDP (Fig. [3}. All computations are carried out 
on a RHEL platform with 2.80 GHz processor and 4 GB 
memory. 



Figure 3: A compact Levi graph representation of policing protest 
as a frame-action hypergraph for (a) the transition function, and ( b ) 
the reward function at site 0. Variables x and x' represent the start 
and end intensities of the protest at site 0 and the action shows the 
location of the two police troops. As two police troops are sufficient 
to de-escalate any protest, the contexts in which both troops are 
at site 0 are independent of the actions of other agents. All other 
contexts depend on the agents choosing to protest at site 0 only. 


To evaluate the computational gain obtained by exploiting 
problem structures, we implemented a solution algorithm 
similar to the one described earlier that does not exploit 
any problem structure. A comparison of the Many-Agent 
l-POMDP with the original l-POMDP yields two impor¬ 
tant results: ( i ) When there are few other agents, the Many- 
Agent l-POMDP provides exactly the same solution as the 
original l-POMDP but with reduced running times by ex¬ 
ploiting the problem structure, (it) Many-Agent l-POMDP 
scales to larger agent populations, from 100 to 1,000+, and 
the new framework delivers promising results within reason¬ 
able time. 


Protestors 

H 

l-POMDP 

Many-Agent 

Exp. Value 


2 

1 s 

0.55 s 

77.42 


3 

19 s 

17 s 

222.42 


2 

3s 

0.56 s 

77.34 

J 

3 

38 s 

17 s 

222.32 


2 

39 s 

0.57 s 

76.96 


3 

223 s 

17 s 

221.87 


2 

603 s 

0.60 s 

76.88 

D 

3 

2,480 s 

18 s 

221.77 


Table 1: Comparison between traditional l-POMDP and Many- 
Agent l-POMDP both following same solution approach of com¬ 
puting a reachability tree and performing backup. 


In the first setting, we consider up to 5 protestors with dif¬ 
ferent frames. As shown in Table Q] both the traditional and 
the Many-Agent l-POMDP produce policies with the same 
expected value. However, as the Many-Agent l-POMDP 
losslessly projects joint actions to configurations, it requires 
much less running time. 
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Number of Agents 
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Figure 4: (a) The computational gain obtained by observation 
sampling. (6) Performance of Many-Agent l-POMDP with obser¬ 
vation sampling for horizon 3 and 4. The time required to solve a 
problem is polynomial in the number of agents. 


Our second setting considers a large number of protestors, 
for which the traditional l-POMDP does not scale. Instead, 
we first scale up the exact solution method using Many- 
Agent l-POMDP to deal with a few hundreds of other agents. 
Although the exploitation of the problem structures reduces 
the curse of dimensionality that plagues l-POMDPs, the 
curse of history is unaffected by such approaches. To miti¬ 
gate the curse of history we use the well-known observation 
sampling method (Doshi and Gmytrasiewicz 2 009[ >, which 
allows us to scale to over 1,000 agents in a reasonable time 
of 4.5 hours as we show in Fig.[4)a). This increases to about 
7 hours if we extend the horizon to 4 as shown in Fig.[4|6). 


Conclusion 

The key contribution of the Many-Agent l-POMDP is its 
scalability beyond 1,000 agents by exploiting problem struc¬ 
tures. We formalize widely existing problem structures - 
frame-action anonymity and context-specific independence- 
and encode it as frame-action hypergraphs. Other real-world 
examples exhibiting such problem structure are found in eco¬ 
nomics where the value of an asset depends on the num- 

























ber of agents vying to acquire it and their financial standing 
(frame), in real estate where the value of a property depends 
on its demand, the valuations of neighboring properties as 
well as the economic status of the neighbors because an up¬ 
scale neighborhood is desirable. Compared to the previous 
best approach dSonu and Doshi 20141 . which scales to an ex¬ 
tension of the simple tiger problem involving 5 agents only, 
the presented framework is far more scalable in terms of 
number of agents. Our future work includes exploring other 
types of problem structures and developing approximation 
algorithms for this l-POMDP. An integration with existing 
multiagent simulation platforms to illustrate the behavior of 
agent populations may be interesting. 
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Appendix 

Factored Belief Update: 

b 0 ^s t+1 ,m t 1 + \...,m t + 1 ) 

= Pr(s t+1 | 6 q, a,Q, uJq +1 ) x Pr(m* 1 +1 |s t+1 ,^,a[ )) u;^ +1 ) 
x • • • x Pr(m t + 1 \s t+1 ,m t 1 +1 ,... ,m£.Xa*, W $ +1 ) 


\C U{X k’ a O' x k • ) | 


Pr ( 4 +1 i fo o,i> a o) = Z] 6 o,i( s ‘) XI X 


C— 1 


6o,t( m -ol s *) X ^(aiolmio) 

a*_ 0 eA% 


Derivation of equation 5: 

Starting with equation 3, we have: 

^K* t+1 1*0,1. Oo>w$ +1 ) OC Pr(s t+1 ,4 +1 |&o,J’ a o) 

= Pr(s* +1 |6g Z , a*)Pr(w* +1 |s t+1 ,&£,„ a *) 

The update term for physical states may be represented as 
a product of its factors such that for any factor X /.: 

p K4 +1 Ki,«o) =X 6 cm 0‘) X 6 o,;( m -ol st )x 

s* m *_ 0 

^Pr(a t _ 0 |m t -o) Tq(x\, (oq, a^_ 0 ), 4 +1 ) 

a* 


The cumulative probability of joint actions map¬ 
ping to the same configuration, ^ ^o;( m ~ol s< ) 

m ‘-0 

Pr(a^ 0 |m^_ 0 ), is computed tractably using algo- 

a -o eA -o 

rithm 1 as Pr(C 1/ ( I *’“o’4 +1 )|6 0; /(M‘|s*),... &o,z(Af 4 |s 4 )). 

Hence, the equation becomes: 


PK4 +1 |6o,l.Oo) = X & o,i(« t ) X Pr{C^ a 0’< +1 ) 

S* a* a:*"*" 1 '! 

C U '' X k ,a O ,x k > 

Is*)) X To(x t , Oq, C 44 ’“°’ 4+1 \ x t+1 


where &4( m -ol s ‘) = &o4 m il 4 ) x . •. x and 

P?’(a^_ol m -o) = Pr(a\\m\) x ... x Pr(a t N \m t N ). 

We introduce a projection function 8 V ^ that maps joint 
actions to the corresponding frame-action configurations as 
defined in definition 4. Formally 8 V W : a — > where 

C !y (V') j s ^e set of all possible configurations such that for 
all agents j with frame 9, C(a,9) = |{j : ctj = a,9j = 
9, ( a,j,e ) e v{il>)}\. 

Next we partition the set of joint action of all other agents 

f 1 I 

A_o into smaller subsets Ai 0 , ..., A_ 0 1 such that the 

projection function 8 V< -X'> maps all joint actions belonging 
to any given partition A c i 0 to the same value configuration. 
Hence, we may rewrite the above equation as: 


\C‘' ( - a ’k' a 0’ xt k~ 1 ' > \ 


Pr{x\ +1 \bl, l A) = Y. h W) X X 

S t < 

fo O,i( m -ol s< ) X Pr> ( a -ol m -o) 


C=1 


r o(4>( o o> a -o )>4 +1 ) 


Under frame-action anonymity and 
independence, for all joint actions a f _ 0 

T 0 (4,(a‘, a^),^ 1 ) = Toixla^C^o 
where C l, ^ Xk ' a °' x ^ > = 8 v ( x *' a o< x k )( a ^ 0 ). 


frame-action 

e A c _ 0 

,X t + 1 ) ~t+l\ 

> x fe /> 


Similarly, the observation probability may also be 

/ i+1 t i+l\ 

obtained in a factored form and C VKS ,a ° ,a; o ) = 

8v{s ,a 0 ,ui 0 ) (a^Q j may be substitued instead of the joint 
action. 


K 


Pr(4 +1 | S 4+1 ,6‘ ji ,a‘) = X 6 o,/( s ‘) II X 

s‘ fc=l c , ( 4+i, a t,^+i) 

p r{r K + \ a Wo +1 ) |&S,/(Af 1 ,i- 1 |s t ) . 6 S ii (Af J v, I _ 1 |s*)) 

> a o> c 0 > w o J 

Therefore, we may rewrite equation 3 as follows: 

K 


Pr(s t+1 |6q °o> w o +1 ) « j X 6 o,t( s ‘) ft X 

L s* fe =VK +1 -°S’"S +1 > 

Pr(C‘'(*i +1 -°S>“S +1 )|6* iI (M 1 , I _ 1 |s t ),...,^(M^uls 4 )) 

00(4^, al x |^64 (s 4 ) f[ 

' ^ s* k=1 

X Pr (C 44 ’°°’ 4+1 ) |&o,j(-^i4-i I s *) *..., 

C u ^ x k ,a 0 ,x k > 

bl tl |s‘ ))T 0 (4, C 44 4 >4+1) , 4 +1 ) | (9) 


Derivation of equation 6: 

The belief update over the models of agent j shown in equa- 



tion 4 can be rewritten as follows: 


Pr{m]+iy + \vT t +\ l _ l ,. .., 4,44) 

= E 6 o(4 E b oA m \,i-i\ st )J2 Pr ( a i\ m ii-^ 

s * “I 

a 5-i 




4*(4+i4-il4 E ^r(a* +1 |m* ; _ x ) ... ^ 


°i+i 

4,i (”4,Z-1 |S* ) E Pr (4v l"4,Z-l) E 


<W + \ (aj,a_j} 5 <4 + ) Pr(mf ■ L |m*-4>4 L ) 

Substituting frame-action configuration as in equation 9, 
we get: 

Pr(m*+ii|s t+1 ,m‘+i ii -i ,... ,m t +J_ 1 ,b t 0!l ,a t 0 ,uj t + 1 ) 

= E44 E 4414 X! Pr 4i4) X! 

s* m‘- a*. ( ,v(x t + 1 ,at,u J ) 

Pr(C«* t+1 ’ a >M, b U M rls 4 ), • ■ •, 

6‘/M j+ i|s t ),...6^(m jV |s t )) x E°i4 +1 ,4 


C «* t+1 ’<*\ cufi) Pr(m‘ +1 |m‘, a‘, 4 +1 ) 


( 10 ) 


Where the probability over the configurations is computed 
as in algorithm 1 using belief over models of all other agents 
except j. In the end we add 1 to the count of action ao in 
every configuration. 

Complexity of belief update: 

For the traditional l-POMDP belief update, the complex¬ 
ity of computing equation 3 is (9(|S'|(|M*|) iv (|^4||) 7V ) and 
that for computing equation 4 is 0(\S\\Mj \ N |A;-4)4/1) 
where * denotes the maximum cardianlity of a set for 
any agent. For factored representation, belief update oper¬ 
ator invokes equation 3 for each value of all state factors 
and it invokes equation 4 for each model of each agent 
j and for all values of updated states. Hence the total 
complexity of belief update is 0(A"|X*||5||M*| Ar |^4|| Ar 

+N\M*\\s\ 2 (\M*\) N \A*\ N \n*\). 

In equation 5, algorithm is called once for all values of s*. 
The two inner summations iterate over all possible config¬ 
urations over transition and observation contexts. The num¬ 
ber of configurations is upper bounded by ') where 

\v*\ is the maximum cardinality of the frame-action neigh¬ 
borhood for any context. Hence the complexity of comput¬ 
ing updated belief of state factor x t+1 using equation 5 is 
0(\S\K x {N\M* I \A*\\v*\( N +}; j* 1 ) +4+1 1 ;* 1 )}) (recall 
the complexity of algorithm 1). Similarly, the complexity 
of computing updated model probability using equation 6 is 


0(|5|x{W|M;||H;||^1(^*l) + |^|4+L i j*l)}).These 
complexity terms are polynomial in N for small values of 
\v*\ as opposed to exponential in N as in equations 3 and 4. 
The overall complexity of belief update is also polynomial 
in N. 

Proof of Proposition 1: 

The expected reward of agent 0 is obtained as the sum of 
reward factors. 


ERo(b, 


'0,Z> a 0 


) = E4(4(EE4(4l4x--->< 

s* ' k— 1 m t „ 


4(4vl4 x E Pr ( a iK,i-i) X ... x Pr(a t N \m t N i_ 1 ) 


P-o( x ki i a O’ 



Complexity of computing expected reward using the 
above equation is 0(\S\K(\M*\) N (\A*\) N ). Equation 8 is 
derived similarly to the belief update by substituting dis¬ 
tribution over frame-action configurations for distributions 
over joint models and joint actions. This combined with the 
proofs for Eqs. 5 and 6 allow us to obtain Eq.7 from the 
Bellman equation of the original l-POMDP. 

The complexity of computing expected reward using 
equation 8 is 

(D(\S\K {N\M*\\A*\\v* \ + l} ')) which is again poly¬ 

nomial in N for low values of \v* |. 



